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CURL: Co-trained Unsupervised Representation 
Learning for Image Classification 

Simone Bianco, Gianluigi Ciocca, and Claudio Cusano 


Abstract —In this paper we propose a strategy for semi- 
supervised image classification that leverages unsupervised 
representation learning and co-training. The strategy, that 
is called CURL from Co-trained Unsupervised Represen¬ 
tation Learning, iteratively builds two classifiers on two 
different views of the data. The two views correspond to 
different representations learned from both labeled and 
unlabeled data and differ in the fusion scheme used to 
combine the image features. 

To assess the performance of our proposal, we conducted 
several experiments on widely used data sets for scene 
and object recognition. We considered three scenarios (in¬ 
ductive, transductive and self-taught learning) that differ 
in the strategy followed to exploit the unlabeled data. 
As image features we considered a combination of GIST, 
PHOG, and LBP as well as features extracted from a Con¬ 
volutional Neural Network. Moreover, two embodiments of 
CURL are investigated: one using Ensemble Projection as 
unsupervised representation learning coupled with Logistic 
Regression, and one based on LapSVM. The results show 
that CURL clearly outperforms other supervised and semi- 
supervised learning methods in the state of the art. 

Index Terms —Image classification, machine learning 
algorithms, pattern analysis, semi-supervised learning. 

1. Introduction 

Semi-supervised learning Q consists in taking into 
account both labeled and unlabeled data when training 
machine learning models. It is particularly effective when 
there is plenty of training data, but only a few instances 
are labeled. In the last years, many semi-supervised 
learning approaches have been proposed including gen¬ 
erative methods 0,0, graph-based methods 0, 0, 
and methods based on Support Vector Machines |^, 
0 - Co-training is another example of semi-supervised 
technique 0. It consists in training two classifiers 
independently which, on the basis of their level of 
confidence on unlabeled data, co-train each other trough 
the identification of good additional training examples. 
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The difference between the two classifiers is that they 
work on different views of the training data, often 
corresponding to two feature vectors. Pioneering works 
on co-training identified the conditional independence 
between the views as the main reason of its success. 
More recently, it has been observed that conditional 
independence is a sufficient, but not necessary condition, 
and that even a single view can be considered, provided 
that different classification techniques are used 0. 

In this work we propose a semi-supervised image 
classification strategy which exploits unlabeled data in 
two different ways: first two image representations are 
obtained by unsupervised representation learning (URL) 
on a set of image features computed on all the avail¬ 
able training data; then co-training is used to enlarge 
the labeled training set of the corresponding co-trained 
classifiers (C). The difference between the two image 
representations is that one is built on the combination of 
all the image features {early fusion), while the other is 
the combination of sub-representations separately built 
on each feature {late fusion). We call the proposed 
strategy CURL: Co-trained Unsupervised Representation 
Learning (from the combination of C and URL compo¬ 
nents). The schema of CURL is illustrated in Fig. 

In standard co-training each classifier is built on a 
single view, often corresponding to a single feature. 
However, the combination of multiple features is often 
required to recognize complex visual concepts |T0|- 
[ [Tj| . Both the classifiers built by CURL exploit all 
the available image features in such a way that these 
concepts can be accurately recognized. We argue that 
the use of two different fusion schemes together with the 
non-linear transformation produced by the unsupervised 
learning procedure, makes the two image representations 
uncorrelated enough to allow an effective co-training of 
the classifiers. 

The proposed strategy is built on two base compo¬ 
nents: URL (the unsupervised representation learning) 
and C (the classifier used in co-training). By changing 
these two components we can have different embodi¬ 
ments of CURL that can be experimented and evaluated. 

To assess the merits of our proposal we conducted 
several experiments on widely used data sets: the 15- 
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Fig. 1. Schema of the proposed strategy. 


scene data set, the Caltech-101 object classification data 
set, and the ILSVCR 2012 data set which contains 1000 
different classes. We considered a variety of scenarios 
including transductive learning (i.e. unlabeled test data 
available during training), inductive learning (i.e. test 
data not available during training), and self-taught learn¬ 
ing (i.e. test and training data coming from two different 
data sets). In order to verify the efficacy of the CURL 
classification strategy, we also tested two embodiments: 
one that uses Ensemble Projection unsupervised repre¬ 
sentation coupled with Logistic Regression classification, 
and one based on LapSVM semi-supervised classifica¬ 
tion. Moreover different variants of the embodiments are 
evaluated as well. The results show that CURL clearly 
outperforms other semi-supervised learning methods in 
the state of the art. 

II. Related Work 

There is a large literature on semi-supervised learning. 
For the sake of brevity, we discuss only the paradigms 
involved in the proposed strategy. More information 
about these and other approaches to semi-supervised 
learning can be found in the book by Chapelle et al. Q. 

A. Co-training 

Blum and Mitchell proposed co-training in 1998 ^ 
and verified its effectiveness for the classification of web 
pages. The basic idea is that two classifiers are trained 
on separate views (features) and then used to train each 
other. More precisely, when one of the classifiers is very 


confident in making a prediction for unlabeled data, the 
predicted labels are used to augment the training set of 
the other classifier. The concept has been generalized 
to three |13| or more views O ng. Co-training 
has been used in several computer vision applications 
including video annotation fT^ , action recognition 
traffic analysis [flSl, speech and gesture recognition 


image annotation biometric recognition pT| , image 
retrieval p2| , image classification f23l , object detec¬ 
tion [ [T^ , and object tracking f25\. 

According to Blum and Mitchell, a sufficient condition 
for the effectiveness of co-training is that, beside being 
individually accurate, the two classifiers are condition¬ 
ally independent given the class label. However, condi¬ 
tional independence is not a necessary condition. In fact. 
Whang and Zhou [ |26| showed that co-training can be 
effective when the diversity between the two classifiers 
is larger than their errors; their results provided a theo¬ 
retical support to the success of single-view co-training 
variants p7|-[[29| (the reader may refer to an updated 


study from the same authors P0| | for more details about 
necessary and sufficient conditions for co-training). 


B. Unsupervised representation learning 

In the last years, as a consequence of the success 
of deep learning frameworks we observed an increased 
interest in methods that make use of unlabeled data to 
automatically learn new representations. In fact, these 
have been demonstrated to be very effective for the 
pre-training of large neural networks pT| , Re¬ 

stricted Boltzmann Machines and auto-encoder net¬ 


works p4[ are notable examples of this kind of methods. 
The tutorial by Bengio covers in detail this family of 
approaches [ |^ . 

A conceptually simpler approach consists in using 
clustering algorithms to identify frequently occurring 
patterns in unlabeled data that can be used to define ef¬ 
fective representations. The K-means algorithm has been 
widely used for this purpose [ [^ . In computer vision this 
approach is very popular and lead to the many variants of 

Briefly, 


bag-of-visual-words representations p7| , 
clustering on unlabeled data is used to build a vocabulary 
of visual words. Given an image, multiple local features 
are extracted and for each of them the most similar visual 
word is searched. The final representation is a histogram 
counting the occurrences of the visual words. Sparse cod¬ 
ing can be seen as an extension of this approach, where 
each local feature is described as a sparse combination 
of multiple words of the vocabulary [391- ED- 

Another strategy for unsupervised feature learning is 
represented by Ensemble Projection (EP) |42||. From 
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all the available data (labeled and unlabeled) Ensemble 
Projection samples a set of prototypes. Discriminative 
learning is then used to learn projection functions tuned 
to the prototypes. Since a single set of projections could 
be too noisy, multiple sets of prototypes are sampled to 
build an ensemble of projection functions. The values 
computed according to these functions represent the 
components of the learned representations. 

LapSVM Q can be seen as an unsupervised rep¬ 
resentation learning method as well. In this case the 
learned representation is not explicit but it is implicitly 
embedded in a kernel learned from unlabeled data. 


C. Fusion schemes 


Combining multimodal information is an important 
issue in pattern recognition. The fusion of multimodal 
inputs can bring complementary information from var¬ 
ious sources, useful for improving the quality of the 
image retrieval and classification performance [ |43| . The 
problem arises in defining how these modalities are to 
be combined or fused. In general, the existing fusion 
approaches can be categorized as early and late fusion 
approaches, which refers to their relative position from 
the feature comparison or learning step in the whole 
processing chain. Early fusion usually refers to the 
combination of the features into a single representation 
before comparison/learning. Late fusion refers to the 
combination, at the last stage, of the responses obtained 
after individual features comparison or learning [ |44| , 
p31. There is no universal conclusion as to which 


strategy is the preferred method for for a given task. 


Eor example, Snoek et al. [ |44[ found that late fusion is 
better than early fusion in the TRECVID 2004 seman¬ 
tic indexing task, while Ayache et al. p3| stated that 
early fusion gets better results than late fusion on the 
TRECVID 2006 semantic indexing task. A combination 
of these approaches can also be exploited as hybrid 
fusion approach [ |47| . 

Another form of data fusion is Multiple Kernel Learn¬ 
ing (MKL). MKL has been introduced by Lanckriet et 
al. pSi as extension of the support vector machines 
(SVMs). Instead of using a single kernel computed on 
the image representation as in standard SVMs, MKL 
learns distinct kernels. The kernels are combined with a 
linear or non linear function and the function’s parame¬ 
ters can be determined during the learning process. MKL 
can be used to learn different kernels on the same image 
representation or by learning different kernels each one 
on a different image representation The former 

corresponds to have different notion of similarity, and 
to choose the most suitable one for the problem and 


representation at hand. The latter corresponds to have 
multiple representations each with a, possibly, different 
definition of similarity that must be combined together. 
This kind of data fusion, in p5l , is termed intermediate 
fusion. 


III. The Proposed strategy: CURL 


In the semi-supervised image classification setup 
the training data consists of both labeled examples 
{Xi^y} = and unlabeled ones Xu = 

where denotes the feature vector of image 
i, G {1,..., iL} is its label, and K is the number of 
classes. 

In this work, for each image i a set of S different 
image features x^^^\ s = 1,...,*S is considered. Two 
views are then generated by using two different fusion 
strategies: early and late fusion. In case of Early Eusion 
(EE), the image features are concatenated and then used 
to learn a new representation xf^ = (/^({[x^^^\ ..., 
in an unsupervised way, where (/p(-) is a projection 
function. In case of Late Eusion (LE), an unsupervised 
representation is independently learned for each 

image feature and then the representations are concate¬ 
nated to obtain xf^ = [(/Pi(x^^^^),..., (^s'(x^^'^^)]. 

Using the learned EE and LE unsupervised representa¬ 
tions, the two views are built: X^^ = 

{xf-}fJ-4, and xr = {xrlt. .• 

Eurthermore, two label sets and y^^ are initialized 
equal to y. 

Once the two views are generated, our method iter¬ 
atively co-trains two classifiers ^ef and ^lf on them 
E- SVMs, logistic regressions, or any other similar 
technique can be used to obtain them. The idea of 
iterative co-training is that one can use a small labeled 
sample to train the initial classifiers over the respective 
views (i.e. ^ef • ^ y^^ and ^lf ^ 

y^^), and then iteratively bootstrap by taking unlabeled 
examples for which one of the classifiers is confident 
but the other is not. The confident classifier determines 
pseudo-labels [ [50l that are then used as if they were 
true labels to improve the other classifier pT| . Given 
the classifier confidence scores wf^ = (/)£;^(xf^) and 
— (/)^j,(xf^), the pseudo-labels yf^ and are 
respectively obtained as: 


yr = arg max wf^ [j] (1) 

yr = arg max [j] (2) 

In each round of co-training, the classifier fuF chooses 
some examples in Xf^ to pseudo-label for ^ef^ and vice 
versa. Eor each class fc, let us call X^ the set of candidate 
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unlabeled examples to be pseudo-labeled for Each 
G must belong to the unlabeled set, i.e. G 
has not to be already used for training, i.e. x^ ^ Xf^, 
and its pseudo-label has to be = k. Furthermore, 
(pLF should be more confident on the classification of 
x^ than (f)EF, and its confidence should be higher than a 
fixed threshold ti: 


Vx^ G a; : w^''[A:] >ti (3) 

If no x^ satisfying Eq.j^are found, then the constraints 
are relaxed: 

Vx^ G : w^^[A:] >^2, with t 2 < h (4) 

Non-maximum suppression is applied to add one sin¬ 
gle pseudo-labeled example for each class by extracting 
the most confident x^ G 

find x^ G A^ : w^^[fc] = argmax w:^^[A:] (5) 

j 

The selected x^ and its corresponding pseudo-label 
are added to A^^^ and respectively. If no x^ satis¬ 
fying Eq.j^are found, then nothing is added to A^^^ and 

Similarly, the classifier (p^F chooses some examples 
in A^^ to pseudo-label for At the next co-training 
round, two new classifiers Pef and Plf are trained on 
the respective views, that now contain both labeled and 
pseudo-labeled examples. The complete procedure of the 
CURL method is outlined in Algorithms [T][^ 

IV. Experiments 

CURL is parametric with respect to the projection 
function (f used in the unsupervised representation learn¬ 
ing URL, and the supervised classification technique C 
used during to co-train Pef and Plf- As first embodiment 
of CURL, we used Ensemble Projection for the 
former and logistic regression for the latter. Another 
embodiment, based on LapSVM 0 is presented in 
Section IV-CI 


A. Data sets 


We evaluated our method on two data sets: Scene- 
15 (S-15) 


and Caltech-101 (C-101) ||^. Scene- 
15 data set contains 4485 images divided into 15 scene 
categories with both indoor and outdoor environments. 
Each category has 200 to 400 images. Caltech-101 
contains 8677 images divided into 101 object categories, 
each having 31 to 800 images. Furthermore, we collected 
a set of random images by sampling 20,000 images from 
the ImageNet data set ||5^ to evaluate our method on the 


Algorithm 1: CURL 

Data; Labeled data {A!i,y}, unlabeled data Xu 
Result; Classifiers and 

begin 

= computeURL(A’/, 

yEF ^yLF 

train classifier (f)EF ■ 
train classifier '■ 
for co-training round c = 1 : C do 

initialize = y^^ = <h 

foreach xf^ G do 

add wf^ = 4>FF{-^f^) to 

add = arg max to 3^®^ 

L 3=t,...F 

foreach xj^^ € do 

add to 

add = arg max w^^[j] to y^^ 

[_ j=l,...,K 

for class number fc = 1 : iC do 

for (ui,'U 2 ) G {lf.ef)} do 

find {X^,y^^} with 

x^cx^\ x^nxi’^=9, y^^ c y'’^ s.t: 

Vx* G a; and G j)^^ hold: 

y*" = k,w^^[k] < w^^[k],wl^[k] >ti 

if Af* = 0 then 

find with 

x^cx^^ , x^nx;’^ = 0, y^- c 

s.t.: 

Vx*gA4 and Gj>^L hold: 

_ yl'" = k, [k] > t2 
[A4 ,y^^] = nonMaxSuppr(A'* ,y^^ 

Af^^i = Xl’^ U A:; 

L 37^1 = 37*^1 u y^^ 

train classifier (f)EF ■ 3^^^ 

train classifier : X^^^ i->- 3^^^ 


Algorithm 2: compute URL 
Data; Labeled data Af; and unlabeled data Af„ 

Result: Unsup. representations Xj^^, X^^, X/"^, X^^ 

begin 


( 1 ) 


,x 






Learn EE representation (f on {[x 

A:r = m4\- 

Learn LF representations cps on 


(^) 


]} 




\h= 


L 
i- 

L+U ^ 
L+lJ 


(s)xL^U 
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= 


L+U 
L+1 


L+t/ 

i=l 



















5 


Algorithm 3: non-maximum suppression 

Data: X,y,W,k 
Result: j)^ 

begin 

find {a;, 3>4 with a; G Af, G s.t.: 
[k] = argmax Wj [A:], with Wj G W 


task of self-taught image classification. Since the current 
version of ImageNet has 21841 synsets (i.e. categories) 
and a total of more than 14 millions images, there is a 
small probability that the random images and images 
in the two considered data sets come from the same 
distribution. 


B. Image features 

In our experiments we used the following three fea¬ 
tures: GIST {5^ , Pyramid of Histogram of Oriented Gra¬ 
dients (PHOG) [|5^, and Local Binary Patterns (LBP) 


GIST was computed on the rescaled images of 
256x256 pixels, at 3 scales with 4, 8 and 8 orienta¬ 
tions respectively. PHOG was computed with a 2-layer 
pyramid and in 8 directions. Uniform LBP with radius 
equal to 1, and 8 neighbors was used. 

In Section V-B we also investigate the use of features 
extracted from a CNN f5T\ in combination with the 
previous ones. 


C. Ensemble projection 

Differently from others semi-supervised methods that 
train a classifier from labeled data with a regularization 
term learned from unlabeled data. Ensemble Projection 
1421 learns a new image representation from all known 
data (i.e. labeled and unlabeled data), and then trains a 
plain classifier on it. 

Ensemble Projection learns knowledge from T dif¬ 
ferent prototype sets = {(sf, with t G 

{1,..., T} where sj G {1,..., L + U} is the index of 
the i— th chosen image, cj G {1,. .. ,r} is the pseudo¬ 
label indicating to which prototype sj belong to. r is 
the number of prototypes in V^, and n is the number of 
images sampled for each prototype. Eor each prototype 
set, m hypotheses are randomly sampled, and the one 
containing images having the largest mutual distance is 
kept. 

A set of discriminative classifiers (/)^(-) is learned on 
one for each prototype set, and the projected vectors 
(/)^(x^) are obtained. The final feature vector is obtained 
by concatenating these projected vectors. 


Eollowing [ |42| we set T = 300, r = 30, n = 6, 
m = 50, using Logistic Regression (LR) as discrimina¬ 
tive classifier (/)^(-) with C = 15. 

Within CURL, Ensemble Projection is used to learn 
both Early Eusion and Late Eusion unsupervised repre¬ 
sentations. In the case of Early Eusion (EE), the feature 
vector is obtained concatenating the S different fea¬ 
tures available x^ = ..., 5 = 1,..., 5. In the 

case of Late Eusion (LE), the feature vector x^ is made by 
considering just one single feature at time x^ = x^^^^. Eor 
both EE and LE, the same number T of prototypes is used 
in order to assure that the unsupervised representations 
have the same size. 


D. Experimental settings 

We conducted two kinds of experiments: (1) com¬ 
parison of our strategy with competing methods for 
semi-supervised image classification; (2) evaluation of 
our method at different number of co-training rounds. 
We considered three scenarios corresponding to three 
different ways of using unlabeled data. In the inductive 
learning scenario 25% of the unlabeled data is used 
together with the labeled data for the semi-supervised 
training of the classifier; the remaining 75% is used 
as an independent test set. In the transductive learning 
scenario all the unlabeled data is used during both 
training and test. In the self-taught learning scenario the 
set of unlabeled data is taken from an additional data set 
featuring a different distribution of image content (i.e. the 
20,000 images from ImageNet); all the unlabeled data 
from the original data set is used as an independent test 
set. 

As evaluation measure we followed [ |4^ and used 
the multi-class average precision (MAP), computed as 
the average precision over all recall values and over all 
classes. Different numbers of training images per class 
were tested for both Scene-15 and Caltech-101 (i.e. 1, 2, 
3, 5, 10, and 20). All the reported results represent the 
average performance over ten runs with random labeled- 
unlabeled splits. 

The performance of the proposed strategy are 
compared with those of other supervised and semi- 
supervised baseline methods. As supervised classi¬ 
fiers we considered Support Vector Machines (SVM). 
As semi-supervised classifiers, we used LapSVM 0, 
[ |58| . LapSVM extend the SVM framework including 
a smoothness penalty term defined on the Laplacian 
adjacency graph built from both labeled and unlabeled 
data. Eor both SVM and LapSVM we experimented 
with the linear, RBE and kernels computed on the 
concatenation of the three available image features as 
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in [ [42| . The parameters of SVM and LapSVM have 
been determined by a greedy search with a three-fold 
cross validation on the training set. We also compared 
the present embodiment of CURL against Ensemble 
Projection coupled with a logistic regression classifier 
(EP-i-LR) as in 


TABLE I 

Mean Average Precision (MAP) oe the baseline 

ALGORITHMS, VARYING THE NUMBER OE LABELED IMAGES PER 
CLASS IN THE THREE LEARNING SCENARIOS CONSIDERED: 
INDUCTIVE (IND), TRANSDUCTIVE (TRD), AND SELE-TAUGHT 

(ST). 


V. Experimental results 

As a first experiment we compared CURL against 
EP-pLR, and against SVMs and LapSVMs with different 
kernels. Specifically, we tested the two co-trained classi¬ 
fiers operating on early-fused and late-fused representa¬ 
tions, both employing EP for URL and LR as classifier C, 
that we call CURL-EP(EP-pLR) and CURL-LP(EP-pLR) 
respectively. We also included a variant of the proposed 
method. It differs in the number of pseudo-labeled ex¬ 
amples that are added at each co-training round. The 
variant skips the non-maximum suppression step, and at 
each round, adds all the examples satisfying Eq. We 
denote the two co-trained classifiers of the variant as 
CURL-EE^(EP-pLR) and CURL-LE^(EP-pLR). 

Eig. shows the classification performance with dif¬ 
ferent numbers of labeled training images per class, in 
the three learning scenarios for both the Scene-15 and 
Caltech-101 data sets. Eor the CURL-based methods 
we considered five co-training rounds, and the reported 
performance correspond to the last round. Eor SVM and 
LapSVM only the results using ^ kernel are reported, 
since they consistently showed the best performance 
across all the experiments. Detailed results for all the 
tested baseline methods, and for the CURL variants 
across the co-training rounds are available in Tables m 
and mil 

The behavior of the methods is quite stable with 
respect to the three learning scenarios, with slightly 
lower MAP obtained in the case of self-taught learning. 
It is evident that our strategy outperformed the other 
methods in the state of the art included in the comparison 
across all the data sets and all the scenarios consid¬ 
ered. Among the variants considered, CURL-LP(EP-i-LR) 
demonstrated to be the best in the case of a small number 
of labeled images, while CURL-LP^(EP-kLR) obtained 
the best results when more labeled data is available. Clas¬ 
sifiers obtained on early-fused representations performed 
generally worse than the corresponding ones obtained on 
late-fused representations, but they are still uniformly 
better than the original EP-i-LR Ensemble Projection 
which can be considered as their non-cotrained version. 
SVMs and LapSVMs performed poorly on the Scene-15 
data set, but they outperformed EP-i-LR and some of the 
CURL variants on the Caltech-101 data set. 


# img 

method 

IND 

Scene-15 
TRD 

ST 


SVM^n 

22.7 

22.3 

22.3 


SYMrbf 

26.5 

25.8 

25.8 


SVM^2 

29.5 

28.7 

28.7 


LapSVMnn 

28.1 

29.1 

26.4 


LapSVMrb/ 

28.8 

29.8 

26.7 


LapSVM^s 

32.3 

33.7 

30.2 


EP+LR 

38.3 

39.3 

32.4 


SYMlin 

27.4 

26.9 

26.9 


SYMrbf 

32.9 

31.3 

31.3 


SVM^2 

35.4 

34.9 

34.9 


LapSVMnn 

33.7 

34.9 

31.2 


LapSVMr-6/ 

34.6 

35.7 

32.5 


LapSVM ^2 

38.1 

39.6 

35.7 


EP+LR 

44.6 

47.3 

41.0 


SVMan 

30.0 

30.2 

30.2 


SVM,b/ 

36.5 

36.7 

36.7 


SVM^2 

39.9 

39.3 

39.3 


LapSVMnn 

37.6 

38.6 

36.4 


LapSVMr-6/ 

37.7 

38.9 

37.2 


LapSVM^s 

42.8 

43.9 

41.6 


EP+LR 

50.8 

53.2 

48.5 


SYMlin 

35.5 

35.2 

35.2 


SYMrbf 

43.5 

42.8 

42.8 


SVM^2 

46.5 

46.1 

46.1 
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LapSVMnn 

43.4 

44.5 

43.5 


LapSVMr-6/ 

43.8 

44.7 

44.0 


LapSVM ^2 

49.2 

50.5 

49.3 


EP+LR 

55.6 

58.6 

55.2 


SVMnn 

43.8 

43.7 

43.7 


SYMrbf 

51.9 

51.3 

51.3 

1 n 

SVM^2 

55.5 

55.2 

55.2 

lU 

LapSVMnn 

51.4 

52.5 

52.3 


LapSVMr-b/ 

52.4 

53.1 

53.0 


LapSVM ^2 

56.6 

57.4 

57.2 


EP+LR 

62.8 

64.4 

62.9 


SVMnn 

50.1 

50.3 

50.3 


SYMrbf 

57.4 

57.1 

57.1 


SVM^2 

60.7 

60.6 

60.6 


LapSVMnn 

55.1 

56.0 

55.5 


LapSVMr-b/ 

57.8 

58.7 

58.3 


LapSVM ^2 

60.9 

61.6 

61.2 


EP+LR 

66.0 

67.9 

67.3 


SVMnn 

54.0 

53.5 

53.5 


SYMrbf 

60.4 

60.3 

60.3 


SVM^2 

64.2 

64.1 

64.1 


LapSVMnn 

58.3 

58.7 

58.6 


LapSVMr-b/ 

60.8 

61.4 

61.2 


LapSVM ^2 

64.5 

65.4 

65.1 


EP+LR 

67.8 

69.7 

69.1 


# img 

method 

Caltech-101 

IND TRD ST 


SVMar. 

6.0 

6.3 

6.3 


SYMrbf 

7.1 

7.3 

7.3 


SVM^2 

9.5 

9.6 

9.6 


LapSVMnn 

9.2 

9.6 

9.1 


LapSVMrb/ 

9.8 

10.2 

9.6 


LapSVM ^2 

10.2 

10.7 

10.0 


EP+LR 

8.3 

8.7 

8.4 


SVMar. 

9.1 

9.2 

9.2 


SYMrbf 

10.1 

9.8 

9.8 


SVM^2 

14.1 

13.7 

13.7 


LapSVMnn 

12.3 

12.8 

12.1 


hapSYMrbf 

12.7 

13.3 

12.4 


LapSVM ^2 

14.6 

14.9 

14.5 


EP+LR 

12.6 

13.1 

12.5 


SVMnn 

10.7 

10.8 

10.8 


SYMrbf 

11.7 

11.6 

11.6 


SYM^2 

16.7 

16.3 

16.3 


LapSVMnn 

13.8 

14.3 

13.5 


LapSVM^h/ 

14.0 

14.6 

13.9 


LapSVM ^2 

17.9 

18.3 

17.6 


EP+LR 

15.5 

15.7 

15.3 


SVMnr. 

13.4 

13.3 

13.3 


SYMrbf 

14.9 

14.6 

14.6 

- 

SVM^2 

20.7 

20.5 

20.5 

5 

LapSVMnn 

16.3 

16.6 

16.0 


LapSVM^fe/ 

16.8 

17.1 

16.6 


LapSVM ^2 

21.7 

22.1 

21.4 


EP+LR 

19.4 

20.0 

19.5 


SYMlin 

- 

17.3 

17.3 


SYMrbf 

- 

19.2 

19.2 

1 n 

SVM^2 

- 

26.0 

26.0 

lU 

LapSVMnn 

- 

20.5 

20.1 


LapSVMr-6/ 

- 

21.6 

21.2 


LapSVM ^2 

- 

28.1 

27.5 


EP+LR 

- 

26.0 

25.2 


SVMnr. 

- 

19.7 

19.7 


SYMrbf 

- 

22.1 

22.1 


SVM^2 

- 

29.1 

29.1 


LapSVMnn 

- 

22.9 

22.5 


LapSVMr-6/ 

- 

24.0 

23.5 


LapSVM ^2 

- 

31.3 

30.9 


EP+LR 

- 

29.5 

29.0 


SVMnr. 

- 

21.5 

21.5 


SYMrbf 

- 

24.2 

24.2 


SVM^2 

- 

31.5 

31.5 


LapSVMan 

- 

24.6 

24.0 


LapSVMr-6/ 

- 

26.8 

26.1 


LapSVM ^2 

- 

33.4 

32.7 


EP+LR 

- 

31.9 

31.0 


Co-training allows to make good use of the early 
fusion representations that otherwise lead to worse re¬ 
sults than late fusion representations. In our opinion 
this happens because the two views capture different 
relationships among data. This fact is visible in Pig. 
which shows 2D projections obtained by applying the 
t-SNE method to GIST, PHOG, LBP features, 
their concatenation, and their learnt early- and late-fused 
representations. Unsupervised representation learning al¬ 
lows t-SNE to identify groups of images of the same 
class. Moreover, representations based on early and late 
fusion induce different relationships among the classes. 
Eor instance, in the second row of Pig. the blue 
and the light green classes have been placed close to 
each other on the bottom right; in Pig. [^, instead, 
the two classes are well separated. The difference in 
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Scene-15 inductive Scene-15 transductive Scene-15 self-taught 



Number of labeled images per class Number of labeled images per class Number of labeled images per class 


Caltech-101 inductive Caltech-101 transductive Caltech-101 self-taught 



Fig. 2. Mean Average Precision (MAP) varying the number of labeled images per class, obtained on the Scene-15 data set (first row), and 
on the Caltech-101 data set (second row). Three scenarios are considered: inductive learning (left column), transductive learning (middle 
column) and self-taught learning (third column). Note that inductive learning on the Caltech-101 data set is limited to 5 labeled images per 
class because otherwise for some classes there wouldn’t be enough unlabeled data left for both training and evaluation. 



Fig. 3. t-SNE 2D projections for the different features used. They are relative to the Scene-15 (top row) and Caltech-101 (bottom row) data 
sets. Different classes are represented in different colors, and the same class with the same color across the row. 


the two representations explains the effectiveness of 
co-training and justifies the difference in performance 
between CURL-EF(EP-kLR) and CURL-LE(EP-kLR). 

As further investigation, we also combined the two 
classifiers produced by the co-training procedure obtain¬ 


ing two other variants of CURL that we denoted as 
CURL-EF&LF(EP-kLR) and CURL-EF&LF^(EP-kLR). 
However, in our experiments, these variants did not 
caused any significant improvement when compared to 
CURL-LF(EP-kLR). 















































TABLE II 

Mean Average Precision (MAP) oe the CURL variants, in the (EP+LR) embodiment, varying the number oe labeled 

IMAGES PER CLASS AT THE DIEEERENT CO-TRAINING ROUNDS OBTAINED ON THE SCENE- 15 DATA SET IN THE THREE LEARNING 
SCENARIOS CONSIDERED: INDUCTIVE (LEET), TRANSDUCTIVE (MIDDLE), AND SELE-TAUGHT (RIGHT). FOR CLARITY, THE (EP-hLR) 

SUEEIXES HAVE BEEN OMITTED. 


# img 

method 

0 

1 

# co-train round 
2 3 

4 

5 


CURL-EF 

38.3 

41.0 

41.1 

41.1 

41.1 

41.2 


CURL-LF 

40.0 

42.4 

43.8 

44.0 

44.3 

44.4 


CURL-EF&LF 


43.3 

43.7 

43.8 

44.0 

44.0 


CURL-EF„ 

38.3 

38.4 

38.5 

38.4 

38.3 

38.2 


CURL-LF„ 

40.0 

39.9 

39.9 

39.7 

39.6 

39.4 


CURL-EF&LF„ 


40.4 

40.4 

40.3 

40.1 

40.0 


CURL-EF 

44.6 

46.3 

46.4 

46.5 

46.6 

46.6 


CURL-LF 

48.5 

50.2 

50.7 

50.9 

51.1 

51.3 


CURL-EF&LF 


49.7 

50.0 

50.1 

50.2 

50.3 


CURL-EF„ 

44.6 

44.3 

44.7 

44.7 

44.5 

44.2 


CURL-LF„ 

48.5 

48.8 

48.8 

48.4 

48.0 

47.3 


CURL-EF&LF„ 


48.0 

48.0 

47.8 

47.4 

46.9 


CURL-EF 

50.8 

52.1 

52.2 

52.3 

52.4 

52.4 


CURL-LF 

54.1 

55.2 

56.2 

56.4 

56.6 

56.8 


CURL-EF&LF 


55.3 

55.7 

55.9 

56.0 

56.1 


CURL-EF„ 

50.8 

51.6 

52.5 

52.6 

52.0 

51.8 


CURL-LF„ 

54.1 

55.3 

55.3 

55.1 

54.7 

54.2 


CURL-EF&LF„ 


55.0 

55.2 

55.0 

54.4 

54.0 


CURL-EF 

55.6 

56.1 

56.2 

56.2 

56.2 

56.2 


CURL-LF 

58.3 

59.0 

59.3 

59.5 

59.5 

59.6 


CURL-EF&LF 


59.0 

59.1 

59.2 

59.3 

59.4 


CURL-EF„ 

55.6 

56.5 

57.7 

57.7 

57.7 

57.6 


CURL-LF„ 

58.3 

59.7 

59.8 

60.0 

59.8 

59.8 


CURL-EF&LF„ 


59.6 

59.9 

59.9 

59.7 

59.6 


CURL-EF 

62.8 

63.1 

63.2 

63.2 

63.2 

63.2 


CURL-LF 

66.2 

66.5 

66.6 

66.9 

67.0 

67.1 

in 

CURL-EF&LF 


66.3 

66.4 

66.5 

66.6 

66.7 

10 

CURL-EF„ 

62.8 

64.7 

65.3 

65.5 

65.8 

65.8 


CURL-LF„ 

66.2 

67.6 

67.8 

68.2 

68.2 

68.3 


CURL-EF&LF„ 


67.5 

67.8 

68.0 

68.0 

68.0 


CURL-EF 

66.0 

66.1 

66.1 

66.1 

66.1 

66.1 


CURL-LF 

69.2 

69.4 

69.6 

69.8 

69.8 

69.9 


CURL-EF&LF 


69.3 

69.3 

69.4 

69.4 

69.5 

15 

CURL-EF„ 

66.0 

67.6 

68.3 

68.5 

68.7 

68.8 


CURL-LF„ 

69.2 

70.6 

70.9 

71.0 

71.0 

70.9 


CURL-EF&LF„ 


70.4 

70.8 

70.8 

70.9 

70.9 


CURL-EF 

67.8 

67.9 

67.9 

67.9 

67.9 

67.9 


CURL-LF 

71.1 

71.3 

71.3 

71.5 

71.5 

71.6 


CURL-EF&LF 


71.2 

71.2 

71.2 

71.2 

71.3 

20 

CURL-EF„ 

67.8 

69.2 

69.8 

70.1 

70.2 

70.3 


CURL-LF„ 

71.1 

72.0 

72.4 

72.5 

72.5 

72.5 


CURL-EF&LF„ 


71.9 

72.2 

72.4 

72.4 

72.4 


# img 

method 

0 

1 

# co-train round 
2 3 

4 

5 


CURL-EF 

39.3 

41.7 

41.9 

41.9 

41.9 

41.9 


CURL-LF 

39.8 

42.6 

43.8 

44.1 

44.2 

44.4 


CURL-EF&LF 


43.5 

43.9 

44.1 

44.1 

44.2 


CURL-EF„ 

39.3 

38.9 

38.9 

38.9 

38.7 

38.7 


CURL-LF„ 

39.8 

39.9 

39.9 

39.7 

39.6 

39.3 


CURL-EF&LF„ 


40.4 

40.4 

40.4 

40.2 

40.0 


CURL-EF 

47.3 

49.2 

49.5 

49.6 

49.7 

49.7 


CURL-LF 

48.7 

50.6 

51.5 

51.9 

52.0 

52.1 


CURL-EF&LF 


51.2 

51.7 

51.9 

52.0 

52.1 


CURL-EF„ 

47.3 

47.4 

47.8 

47.5 

47.1 

46.9 


CURL-LF„ 

48.7 

49.3 

49.0 

48.8 

48.5 

47.8 


CURL-EF&LF„ 


49.6 

49.5 

49.2 

48.8 

48.2 


CURL-EF 

53.2 

54.4 

54.5 

54.6 

54.7 

54.7 


CURL-LF 

54.8 

55.8 

56.6 

57.2 

57.4 

57.5 


CURL-EF&LF 


56.4 

56.8 

57.2 

57.3 

57.4 


CURL-EF„ 

53.2 

53.4 

53.7 

53.5 

52.7 

52.2 


CURL-LF„ 

54.8 

55.4 

55.4 

54.9 

54.4 

53.7 


CURL-EF&LF„ 


55.8 

55.7 

55.2 

54.5 

53.9 


CURL-EF 

58.6 

59.1 

59.2 

59.2 

59.2 

59.2 


CURL-LF 

60.3 

61.3 

61.6 

61.7 

61.9 

61.9 


CURL-EF&LF 


61.5 

61.6 

61.7 

61.8 

61.8 


CURL-EF„ 

58.6 

59.6 

60.4 

60.3 

60.0 

59.7 


CURL-LF„ 

60.3 

61.8 

62.0 

61.9 

61.7 

61.5 


CURL-EF&LF„ 


61.9 

62.1 

61.9 

61.6 

61.3 


CURL-EF 

64.4 

64.6 

64.6 

64.6 

64.7 

64.7 


CURL-LF 

66.6 

67.0 

67.2 

67.3 

67.5 

67.6 

10 

CURL-EF&LF 


67.1 

67.1 

67.2 

67.3 

67.4 

10 

CURL-EF„ 

64.4 

66.2 

67.0 

67.3 

67.4 

67.5 


CURL-LF„ 

66.6 

68.5 

68.8 

69.0 

69.0 

69.0 


CURL-EF&LF„ 


68.4 

68.8 

69.0 

69.0 

69.0 


CURL-EF 

67.9 

68.0 

68.0 

68.0 

68.0 

68.0 


CURL-LF 

70.1 

70.5 

70.6 

70.7 

70.8 

71.0 


CURL-EF&LF 


70.5 

70.6 

70.6 

70.7 

70.8 

15 

CURL-EF„ 

67.9 

69.5 

70.4 

70.7 

70.8 

70.9 


CURL-LF„ 

70.1 

71.7 

71.9 

72.2 

72.3 

72.3 


CURL-EF&LF„ 


71.7 

72.1 

72.3 

72.4 

72.4 


CURL-EF 

69.7 

69.8 

69.8 

69.8 

69.8 

69.8 


CURL-LF 

72.1 

72.2 

72.3 

72.4 

72.5 

72.6 


CURL-EF&LF 


72.3 

72.3 

72.4 

72.4 

72.4 

20 

CURL-EF„ 

69.7 

71.2 

71.7 

72.1 

72.2 

72.3 


CURL-LF„ 

72.1 

73.3 

73.6 

73.7 

73.7 

73.7 


CURL-EF&LF„ 


73.3 

73.6 

73.7 

73.8 

73.8 


# img 

method 

0 

1 

# co-train round 
2 3 

4 

5 


CURL-EF 

32.4 

34.8 

35.2 

35.3 

35.3 

35.3 


CURL-LF 

36.7 

38.6 

39.6 

39.7 

39.8 

39.9 


CURL-EF&LF 


38.4 

38.9 

38.9 

39.0 

39.0 


CURL-EF„ 

32.4 

32.4 

32.8 

32.8 

32.8 

32.3 


CURL-LF„ 

36.7 

34.9 

34.8 

34.8 

34.6 

34.2 


CURL-EF&LF„ 


35.5 

35.5 

35.6 

35.4 

35.0 


CURL-EF 

41.0 

42.4 

42.5 

42.6 

42.6 

42.7 


CURL-LF 

45.4 

47.1 

47.6 

47.5 

47.7 

47.7 


CURL-EF&LF 


46.4 

46.5 

46.6 

46.7 

46.7 


CURL-EF„ 

41.0 

41.2 

42.7 

42.5 

41.6 

41.4 


CURL-LF„ 

45.4 

44.3 

44.2 

43.7 

43.1 

42.3 


CURL-EF&LF„ 


44.8 

45.1 

44.7 

43.7 

43.0 


CURL-EF 

48.5 

49.4 

49.5 

49.5 

49.5 

49.6 


CURL-LF 

53.4 

54.0 

54.5 

54.4 

54.5 

54.5 


CURL-EF&LF 


53.5 

53.6 

53.6 

53.6 

53.6 


CURL-EF„ 

48.5 

49.9 

50.6 

50.6 

49.8 

49.3 


CURL-LF„ 

53.4 

52.0 

52.3 

52.0 

51.6 

51.1 


CURL-EF&LF„ 


52.8 

52.9 

52.6 

51.9 

51.3 


CURL-EF 

55.2 

55.4 

55.5 

55.5 

55.5 

55.4 


CURL-LF 

59.4 

59.8 

60.0 

60.1 

60.1 

60.1 


CURL-EF&LF 


59.3 

59.4 

59.4 

59.4 

59.4 


CURL-EF„ 

55.2 

56.6 

57.5 

57.5 

57.2 

57.1 


CURL-LF„ 

59.4 

59.5 

59.6 

59.6 

59.4 

59.2 


CURL-EF&LF„ 


59.8 

59.8 

59.7 

59.3 

59.1 


CURL-EF 

62.9 

63.1 

63.1 

63.1 

63.1 

63.1 


CURL-LF 

66.0 

66.4 

66.7 

66.7 

66.9 

67.0 

10 

CURL-EF&LF 


66.6 

66.7 

66.7 

66.7 

66.8 

10 

CURL-EF„ 

62.9 

64.7 

65.7 

66.0 

66.1 

66.3 


CURL-LF„ 

66.0 

67.1 

67.3 

67.5 

67.6 

67.6 


CURL-EF&LF„ 


67.2 

67.6 

67.7 

67.8 

67.8 


CURL-EF 

67.3 

67.5 

67.5 

67.5 

67.5 

67.5 


CURL-LF 

70.5 

70.8 

70.9 

71.1 

71.2 

71.3 


CURL-EF&LF 


71.0 

71.0 

71.1 

71.2 

71.2 

15 

CURL-EF„ 

67.3 

69.2 

69.8 

70.1 

70.2 

70.2 


CURL-LF„ 

70.5 

70.8 

71.1 

71.3 

71.5 

71.5 


CURL-EF&LF„ 


71.3 

71.5 

71.7 

71.8 

71.8 


CURL-EF 

69.1 

69.2 

69.2 

69.2 

69.2 

69.2 


CURL-LF 

72.3 

72.4 

72.4 

72.5 

72.6 

72.8 


CURL-EF&LF 


72.6 

72.6 

72.7 

72.7 

72.8 

20 

CURL-EF„ 

69.1 

70.8 

71.5 

71.7 

71.8 

71.9 


CURL-LF„ 

72.3 

72.4 

72.7 

72.8 

72.8 

72.8 


CURL-EF&LF„ 


72.9 

73.2 

73.2 

73.3 

73.3 


A. Performance across co-training rounds 

Here we analyze in more details the performance 
of our Strategy across the five co-training rounds. Re¬ 
sults are reported in Fig. with lines of increasing 
color saturation corresponding to rounds one to five. 
CURL-LF(EP-i-LR) is reported in red lines, while CURL- 
LF^(EP-i-LR) in blue. Results are reported in terms of 
MAP improvements with respect to EP-i-LR, which, we 
recall, corresponds to CURL-EP(EP-i-LR) with zero co¬ 
training rounds. Eor CURL-LP(EP-i-LR), performances 
always increase with the number of rounds. Eor CURL- 
LE^(EP-i-LR), this is not true on the Scene-15 data set 
with a small number of labeled examples. In CURL- 
LE^(EP-i-LR) each round of co-training adds all the 
promising unlabeled samples, with a high chance of 
including some of them with the wrong pseudo-label. 
This may result in a ‘concept drift’, with the classifiers 
being pulled away from the concepts represented by the 
labeled examples. This risk is lower on the Caltech-101 
(which tends to have more homogeneous classes than 
Scene-15) and when there are more labeled images. The 
original CURL-LE(EP-i-LR) is more conservative, since 
each of its co-training rounds adds a single image per 


class. As a result, increasing the rounds usually increases 
MAP and never decreases it by an appreciable amount. 

We observed the same behavior for CURL- 
EP(EP-pLR) and CURL-EP^(EP-pLR). We omit the 
relative figures for sake of brevity. 

The plots confirm that CURL-LP(EP-kLR) is better 
suited for small sets of labeled images, while CURL- 
LP^(EP-i-LR) is to be preferred when more labeled 
examples are available. The representation learned from 
late fused features explains part of the effectiveness of 
CURL. In fact, even CURL-LP(EP-i-LR) without co¬ 
training (zero rounds) outperforms the baseline repre¬ 
sented by Ensemble Projection. 

B. Leveraging CNN features in CURL 

In this further experiment we want to test if the 
proposed classification strategy works when more pow¬ 
erful features are used. Recent results indicate that the 
generic descriptors extracted from pre-trained Convolu¬ 
tional Neural Networks (CNN) are able to obtain con¬ 
sistently superior results compared to the highly tuned 
State of the art systems in all the visual classification 
tasks on various datasets [ [57] |. We extract a 4096- 
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TABLE III 

Mean Average Precision (MAP) oe the CURL variants, in the (EP+LR) embodiment, varying the number oe labeled 

IMAGES PER CLASS AT THE DIEEERENT CO-TRAINING ROUNDS OBTAINED ON THE CALTECH-101 DATA SET IN THE THREE LEARNING 
SCENARIOS CONSIDERED: INDUCTIVE (LEET), TRANSDUCTIVE (MIDDLE), AND SELE-TAUGHT (RIGHT). FOR CLARITY, THE (EP-hLR) 

SUEEIXES HAVE BEEN OMITTED. 


# img 

method 

0 

1 

# co-train round 
2 3 

4 

5 


CURL-EF 

8.3 

10.4 

10.5 

10.5 

10.5 

10.5 


CURL-LF 

10.1 

11.6 

11.8 

11.8 

11.8 

11.8 


CURL-EF&LF 


11.5 

11.6 

11.6 

11.6 

11.6 


CURL-EF„ 

8.3 

8.3 

8.5 

8.5 

8.8 

8.8 


CURL-LF„ 

10.1 

10.3 

10.3 

10.6 

10.7 

10.7 


CURL-EF&LF„ 


9.5 

9.6 

9.8 

10.0 

10.1 


CURL-EF 

12.6 

14.2 

14.3 

14.3 

14.3 

14.2 


CURL-LF 

15.3 

16.3 

16.3 

16.3 

16.2 

16.2 


CURL-EF&LF 


15.9 

15.9 

15.8 

15.8 

15.8 


CURL-EF„ 

12.6 

12.6 

13.4 

13.5 

13.9 

14.1 


CURL-LF„ 

15.3 

15.7 

15.9 

16.1 

16.2 

16.3 


CURL-EF&LF„ 


14.6 

15.1 

15.3 

15.5 

15.7 


CURL-EF 

15.5 

16.8 

16.9 

16.8 

16.9 

16.9 


CURL-LF 

18.8 

19.4 

19.6 

19.6 

19.7 

19.6 


CURL-EF&LF 


18.9 

19.0 

18.9 

18.9 

18.9 


CURL-EF„ 

15.5 

15.9 

16.8 

16.9 

17.1 

17.1 


CURL-LF„ 

18.8 

19.5 

19.8 

20.0 

20.0 

20.0 


CURL-EF&LFn 


18.5 

19.0 

19.2 

19.3 

19.2 


CURL-EF 

19.4 

20.2 

20.4 

20.4 

20.4 

20.4 


CURL-LF 

23.2 

23.4 

23.7 

23.7 

23.7 

23.7 


CURL-EF&LF 


22.9 

23.0 

23.0 

23.0 

23.0 


CURL-EF„ 

19.4 

20.3 

21.0 

21.2 

21.3 

21.3 


CURL-LF„ 

23.2 

24.2 

24.4 

24.6 

24.6 

24.6 


CURL-EF&LFn 


23.3 

23.7 

23.9 

23.9 

24.0 


CURL-EF 








CURL-LF 







in 

CURL-EF&LF 
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CURL-EF„ 








CURL-LF„ 








CURL-EF&LF„ 
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dimensional feature vector from each image using the 
Caffe [60| implementation of the deep CNN described 
by Krizhevsky et al. ig. The CNN was discriminatively 
trained on a large dataset (ILSVRC 2012) with image- 
level annotations to classify images into 1000 different 
classes. Briefly, a mean-subtracted 227 x 227 RGB image 
is forward propagated through five convolutional layers 
and two fully connected layers. Features are obtained 
by extracting activation values of the last hidden layer. 
More details about the network architecture can be found 
in 

We leverage the CNN features in CURL using them as 
a fourth feature in addition to the three used in Section 
The discriminative power of these CNN features 
alone can be seen in Fig. where their 2D projections 
obtained applying the t-SNE [ [5^ method are reported. 

The experimental results using the four features, are 
reported in Fig. for both the Scene-15 and Caltech- 
101 data sets. We report the results in the transductive 
scenario only. It can be seen that the results using the 
four features are significantly better than those using 
only three features mainly due to the discriminative 
power of the CNN features. Furthermore, the CURL 


variants achieve better results than the baselines. This 
suggests that CURL is able to effectively leverage both 
low/mid level features as LBP, PHOG and GIST, and 
more powerful features as CNN. 

C. Second embodiment of CURL using LapSVM 

In this Section we want to evaluate the CURL per¬ 
formance in a different embodiment. Specifically, we 
substitute the EP and LR components with LapSVM- 
based ones. In the LapSVM, first, an unsupervised geo¬ 
metrical deformation of the feature kernel is performed. 
This deformed kernel is then used for classification by a 
Standard SVM thus by-passing an explicit definition of 
a new feature representation. In this CURL embodiment 
we exploit the unsupervised step as surrogate of the URL 
component, and SVM as C component. The EE view 
is obtained concatenating the GIST, PHOG, LBP and 
CNN features and generating the corresponding kernel, 
while the LE one is obtained by a linear combination 
of the four kernels computed on each feature. This is 
similar to what is done in multiple kernel learning 
Due to its performance in the previous experiments, 
the kernel is used for both views. The experimental 
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Scene-15 inductive Scene-15 transductive Scene-15 self-taught 



Number of labeled images per class Number of labeled images per class Number of labeled images per class 


Caltech-101 inductive Caltech-101 transductive Caltech-101 self-taught 



Number of labeled images per class Number of labeled images per class Number of labeled images per class 


— -EP+LR 

♦ CURL-LF (EP+LR) round 0 



CURL-LF (EP+LR) round 1 ♦. CURL-LF^ (EP+LR) round 1 

CURL-LF (EP+LR) round 2 -- CURL-LF^ (EP+LR) round 2 

CURL-LF (EP+LR) round 3 -♦- — CURL-LF^ (EP+LR) round 3 

CURL-LF (EP+LR) round 4 -♦- CURL-LF^ (EP+LR) round 4 

CURL-LF (EP+LR) round 5 » CURL-LF^ (EP+LR) round 5 


Fig. 4. Performance obtained by CURL-LF(EP-i-LR) and CURL-LFn(EP-i-LR) varying the number of co-training rounds. Performance are 
reported in terms of MAP improvement with respect to Ensemble Projection. Due to the small cardinality of some classes, inductive learning 
on the Caltech-101 has been limited to five labeled images per class. 


Scene-15 Caltech-101 



Pig. 5. 2D projections for the CNN features on the two data sets 
used: Scene-15 (left) and Caltech-101 (right). Different classes are 
represented in different colors. 

results on the Scene-15 and Caltech-101 data sets in the 
transductive scenario, are reported in Fig. |7] We named 
the variants of this CURL embodiment by adding the 
suffix (LapSVM). It can be seen that the behavior of the 
different methods is the same of the previous plots, with 
the LapSVM-based CURL outperforming the standard 
LapSVM. The plots confirm that CURL-LF(LapSVM) 


is better suited for small sets of labeled images, while 
CURL-LF^(LapSVM) is to be preferred when more 
labeled examples are available. 

In Fig. and qualitative results for the ‘Panda’ 
class of the Caltech-101 data set are reported: the results 
are relative to the case in which a single instance is 
available for training and one single example is added 
at each co-training round (i.e. each pair of rows corre¬ 
spond to CURL-EF(LapSVM) and CURL-LF(LapSVM) 
respectively). The left part of Fig. [^contains the training 
examples that are added by the CURL-EF(LapSVM) and 
CURL-LE(LapSVM) at each co-training round, while the 
right part and Eig. contain the first 40 test images 
ordered by decreasing classification confidence. Samples 
belonging to the current class are surrounded by a green 
bounding box, while a red one is used for samples 
belonging to other classes. 

In the sets of training images, it is possible to see that 
after the first co-training round, CURL-LE(LapSVM) 
selects new examples to add to the training set, while 
CURL-EE(LapSVM) adds examples seleted by CURL- 
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Scene-15 transductive 


Caltech-101 transductive 



3 5 10 15 

Number of labeled images per class 



Fig. 6. Mean Average Preeision (MAP) varying the number of labeled images per class, obtained on the Scene-15 data set (left), and on 
the Caltech-101 data set (right). Results are obtained using GIST, PHOG, LBP and CNN features. 


Scene-15 transductive Caltech-101 transductive 



Fig. 7. Mean Average Preeision (MAP) varying the number of labeled images per class, obtained on the Scene-15 data set (left), and on 
the Caltech-101 data set (right). Results are obtained using GIST, PHOG, LBP and CNN features. 


LF(LapSVM) in the previous round. This is a pattern 
that we found to occur also in other categories when 
very small training sets are used. 

In the sets of test images, it is possible to see that 
more and more positive images are recovered. Moreover, 
we can see how the images belonging to the correct 
class tends to be classified with increasing confidence 
and move to the left, while the confidences of images 
belonging to other classes decrease and are pushed to 
the right. 

D. Large scale experiment 

In this experiment we want to test the proposed clas¬ 
sification strategy on a large scale data set, namely the 
ILSVRC 2012 which contains a total of 1000 different 
classes. The experiment is run on the ILSVRC 2012 
validation set since the training set was used to learn the 
CNN features. The ILSVRC 2012 validation set, which 
contains a total of 50 images for each class, has been 


randomly divided into a training and a test set containing 
each 25 images per class. Again, different numbers of 
training images per class were tested (i.e. 1, 2, 3, 5, 10, 
and 20). The second embodiment of CURL is used in 
this experiment. 

The experimental results are reported in Fig. and 
represent the average performance over ten runs with 
random labeled-unlabeled feature splits. 

Given the large range of MAP values, the plot of MAP 
improvements with respect to LapSVM baseline is also 
reported. It can be seen that the behavior is similar to that 
of the previous plots, with the LapSVM-based CURL 
variants outperforming the LapSVM. As for the previous 
data sets, the plots show that CURL-EF(LapSVM) and 
CURL-LF(LapSVM) are better suited for small sets of 
labeled images, while CURL-EF^(LapSVM)and CURL- 
LE^(LapSVM) are to be preferred when more labeled 
examples are available. It is remarkable that the proposed 
classification strategy is able to improve the results of 
the LapSVM, since the CNN features were specifically 
































12 


learned for the ILSVRC 2012. 


VI. Conclusions 

In this work we have proposed CURL, a semi- 
supervised image classification strategy which exploits 
unlabeled data in two different ways: first two image 
representations are obtained by unsupervised learning; 
then co-training is used to enlarge the labeled training 
set of the corresponding classifiers. The two image rep¬ 
resentations are built using two different fusion schemes: 
early fusion and late fusion. 

The proposed strategy has been tested on the Scene- 
15, Caltech-101, and ILSVRC 2012 data sets, and 
compared with other supervised and semi-supervised 
methods in three different experimental scenarios: in¬ 
ductive learning, transductive learning, and self-taught 
learning. We tested two embodiments of CURL and 
several variants differing in the co-trained classifier used 
and in the number of pseudo-labeled examples that are 
added at each co-training round. The experimental results 
showed that the CURL embodiments outperformed the 
other methods in the state of the art included in the 
comparisons. In particular, the variants that add a single 
pseudo-labeled example per class at each co-training 
round, resulted to perform best in the case of a small 
number of labeled images, while the variants adding 
more examples at each round obtained the best results 
when more labeled data are available. 

Moreover, the results of CURL using a combination 
of low/mid and high level features (i.e. LBP, PHOG, 
GIST, and CNN features) outperform those obtained on 
the same features by state of the art methods. This 
means that CURL is able to effectively leverage less 
discriminative features (i.e. LBP, PHOG, GIST) to boost 
the performance of more discriminative ones (i.e. CNN 
features). 

References 

[1] O. Chapelle, B. Scholkopf, A. Zien et al. Semi-supervised 
learning. MIT press, 2006. 

[2] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text 
classification from labeled and unlabeled documents using cm,” 
Machine learning, voL 39, no. 2-3, pp. 103-134, 2000. 

[3] A. Fujino, N. Ueda, and K. Saito, “A hybrid genera¬ 
tive/discriminative approach to semi-supervised classifier de¬ 
sign,” in Proc. of the National Conf. on Artificial Intelligence, 
2005, pp. 764-769. 

[4] A. Blum and S. Chawla, “Learning from labeled and unlabeled 
data using graph mincuts,” in Proc. 18th Inti Conf. on Machine 
Learning, 2001, pp. 19-26. 

[5] O. Chapelle, J. Weston, and B. Scholkopf, “Cluster kernels for 
semi-supervised learning,” in Advances in neural information 
processing systems, 2002, pp. 585-592. 


[6] T. Joachims, “Transductive inference for text classification using 
support vector machines,” in Proc. 16th Inf I Conf. on Machine 
Learning, vol. 99, 1999, pp. 200-209. 

[7] M. Belkin, R Niyogi, and V. Sindhwani, “Manifold regu¬ 
larization: A geometric framework for learning from labeled 
and unlabeled examples,” The Journal of Machine Learning 
Research, vol. 7, pp. 2399-2434, 2006. 

[8] A. Blum and T. Mitchell, “Combining labeled and unlabeled 
data with co-training,” in Proc. of the 11th annual Conf. on 
Computational learning theory, 1998, pp. 92-100. 

[9] Z.-H. Zhou and M. Li, “Semi-supervised learning by disagree¬ 
ment,” Knowledge and Information Systems, vol. 24, no. 3, pp. 
415-439, 2010. 

[10] G. Iyengar and H. J. Nock, “Discriminative model fusion 
for semantic concept detection and annotation in video,” in 
Proceedings of the eleventh ACM international conference on 
Multimedia, 2003, pp. 255-258. 

[11] P. Gehler and S. Nowozin, “On feature combination for mul¬ 
ticlass object classification,” in Computer Vision, 2009 IEEE 
12th International Conference on, 2009, pp. 221-228. 

[12] P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis, 
U. Park, and R. Prasad, “Multimodal feature fusion for robust 
event detection in web videos,” in Computer Vision and Pattern 
Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 
1298-1305. 

[13] Z.-H. Zhou and M. Li, “Tri-training: Exploiting unlabeled data 
using three classifiers,” Knowledge and Data Engineering, IEEE 
Transactions on, vol. 17, no. 11, pp. 1529-1541, 2005. 

[14] M. Li and Z.-H. Zhou, “Improve computer-aided diagnosis 
with machine learning techniques using undiagnosed samples,” 
Systems, Man and Cybernetics, Part A: Systems and Humans, 
IEEE Transactions on, vol. 37, no. 6, pp. 1088-1098, 2007. 

[15] Z.-H. Zhou, “When semi-supervised learning meets ensemble 
learning,” Erontiers of Electrical and Electronic Engineering in 
China, vol. 6, no. 1, pp. 6-16, 2011. 

[16] M. Wang, X.-S. Hua, and Y. Dai, L-R.and Song, “Enhanced 
semi-supervised learning for automatic video annotation,” in 
IEEE Inf I Conf. on Multimedia and Expo, 2006, pp. 1485- 
1488. 

[17] S. Gupta, J. Kim, K. Grauman, and R. Mooney, “Watch, listen & 
learn: Co-training on captioned images and videos,” in Machine 
Learning and Knowledge Discovery in Databases, 2008, pp. 
457-472. 

[18] A. Levin, P. Viola, and Y. Freund, “Unsupervised improvement 
of visual detectors using cotraining,” in Proc. of IEEE Inf I 
Conf. on Computer Vision, 2003, pp. 626-633. 

[19] C. Christoudias, K. Saenko, L. Morency, and T. Darrell, “Co¬ 
adaptation of audio-visual speech and gesture classifiers,” in 
Proc. of the Inf I Conf. on Multimodal interfaces, 2006, pp. 
84-91. 

[20] H. Feng and T.-S. Chua, “A bootstrapping approach to anno¬ 
tating large image collection,” in Proc. of the ACM SIGMM 
Inf I Workshop on Multimedia Information Retrieval, 2003, pp. 
55-62. 

[21] H. Bhatt, S. Bharadwaj, R. Singh, M. Vatsa, A. Noore, and 
A. Ross, “On co-training online biometric classifiers,” in Inf I 
Joint Conf. on Biometrics, 2011, pp. 1-7. 

[22] S. Tong and E. Chang, “Support vector machine active learning 
for image retrieval,” in Proc. of ACM Inf I Conf. on Multimedia, 
2001, pp. 107-118. 

[23] M. Guillaumin, J. Verbeek, and C. Schmid, “Multimodal semi- 
supervised learning for image classification,” in IEEE Conf. on 
Computer Vision and Pattern Recognition, 2010, pp. 902-909. 

[24] O. Javed, S. Ali, and M. Shah, “Online detection and clas¬ 
sification of moving objects using progressively improving 
detectors,” in Computer Vision and Pattern Recognition, 2005. 



13 


ILSVRC 2012 transductive 


ILSVRC 2012 transductive 




Fig. 8. Mean Average Precision (MAP) varying the number of labeled images per class, obtained on the ILSVRC 2012 data set: MAP 
values (left) and MAP improvements over LapSVM baseline (right). Results are obtained using GIST, PHOG, LBP and CNN features. 
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Fig. 9. Qualitative results of the proposed strategy for the ‘Panda’ class of the Caltech-101 data set over five co-training rounds. Train 
images are on the left, the first 17 test images, ordered by decreasing classification confidence are on the right. Test images from 18 to 40 
are reported in Fig. 
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Fig. 10. Qualitative results of the proposed strategy for the ‘Panda’ class of the Caltech-101 data set over five co-training rounds. The 
images are ordered by decreasing classification confidence. Training image, and test images from 1 to 17 are reported in Fig. 



































